NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scaling Parallel Algorithms to Massive Datasets using Multi-SSD Machines

https://doi.org/10.1145/3694906.3743308

Li, Haohong; Khan, Jamshed; Dhulipala, Laxman (July 2025, ACM)

Free, publicly-accessible full text available July 16, 2026
Fast, parallel, and cache-friendly suffix array construction

https://doi.org/10.1186/s13015-024-00263-5

Khan, Jamshed; Rubel, Tobias; Molloy, Erin; Dhulipala, Laxman; Patro, Rob (December 2024, Algorithms for Molecular Biology)

Full Text Available
Fulgor: a fast and compact k-mer index for large-scale matching and color queries

https://doi.org/10.1186/s13015-024-00251-9

Fan, Jason; Khan, Jamshed; Singh, Noor_Pratap; Pibiri, Giulio_Ermanno; Patro, Rob (January 2024, Algorithms for Molecular Biology)

Abstract The problem of sequence identification or matching—determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence—is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficientcolored de Bruijngraph index, arising as the combination of ak-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph aremonochromatic(i.e., allk-mers in a unitig have the same set of references of origin, orcolor). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map fromk-mers to their colors in as little as 1 +o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called , and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to —the strongest competitor in terms of index space vs. query time trade-off— requires significantly less space (up to 43% less space for a collection of 150,000Salmonella entericagenomes), is at least twice as fast for color queries, and is 2–6$$\times$$ $\times$ faster to construct.
more » « less
Fast, Parallel, and Cache-Friendly Suffix Array Construction

https://doi.org/10.4230/LIPIcs.WABI.2023.16

Khan, Jamshed; Rubel, Tobias; Dhulipala, Laxman; Molloy, Erin; Patro, Rob (August 2023, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023))
Belazzougui, Djamal; Ouangraoua, Aïda (Ed.)
String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups.
more » « less
Full Text Available
Fulgor: A Fast and Compact k-mer Index for Large-Scale Matching and Color Queries

https://doi.org/10.4230/LIPIcs.WABI.2023.18

Fan, Jason; Singh, Noor Pratap; Khan, Jamshed; Pibiri, Giulio Ermanno; Patro, Rob (August 2023, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023))
Belazzougui, Djamal; Ouangraoua, Aïda (Ed.)
The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.
more » « less
Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

https://doi.org/10.1186/s13059-022-02743-6

Khan, Jamshed; Kokot, Marek; Deorowicz, Sebastian; Patro, Rob (September 2022, Genome Biology)

Abstract The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54–58 h, using considerably more memory.
more » « less
Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

https://doi.org/10.1093/bioinformatics/btab309

Khan, Jamshed; Patro, Rob (July 2021, Bioinformatics)
null (Ed.)
Abstract Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
An incrementally updatable and scalable system for large-scale sequence search using the Bentley–Saxe transformation

https://doi.org/10.1093/bioinformatics/btac142

Almodaresi, Fatemeh; Khan, Jamshed; Madaminov, Sergey; Ferdman, Michael; Johnson, Rob; Pandey, Prashant; Patro, Rob; Boeva, ed., Valentina (March 2022, Bioinformatics)

Abstract MotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data. ResultsIn this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley–Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis’s scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes. Availability and implementationDynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Research in Computational Molecular Biology 27th Annual International Conference, RECOMB 2023, Istanbul, Turkey, April 16–19, 2023, Proceedings

Luo, Runpeng; Lin, Yu; Fan, Jason; Khan, Jamshed; Pibiri, Giulio_Ermanno; Patro, Rob; Tabatabaee, Yasamin; Roch, Sébastien; Warnow, Tandy; Chandra, Ghanshyam; et al (April 2023, Springer Cham)
Tang, Haixu (Ed.)
This book constitutes the refereed proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2023, held in Istanbul, Turkey, from April 16–19, 2023. The 11 regular and 33 short papers presented in this book were carefully reviewed and selected from 188 submissions. The papers report on original research in all areas of computational molecular biology and bioinformatics.
more » « less
Full Text Available

Search for: All records